-
-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache entire public suffix list. Select at runtime. #144
Cache entire public suffix list. Select at runtime. #144
Conversation
d308b92
to
e31272a
Compare
@brycedrennan this is great -- i do suspect it could be backwards compatible without that being a significant setback. it would be good to be able to proceed with this without it being a major version upgrade/ breaking people do you have things fresh enough in your mind to take a whack at making it compatible (and ensuring TravisCI is passing)? if not, let me know, and i can take a turn |
@brycedrennan also --
appreciate you trying to update the snapshot but it appears that you deleted and did not commit the new one? |
It just doesn’t have new lines anymore. It’s a different format. |
@brycedrennan ohh right, now I see
@john-kurkowski thoughts? |
marked as |
@hangtwenty It's been a while and I'll dig into it later but my recollection is that this has to be a breaking change. If we increment the version properly to 3.0.0 that shouldn't cause anyone problems. The reason it needs to be a "breaking" change is that the previous behavior was that the first call of tldextract per machine would forever specify whether or not private domains were included for all calls afterwords. This made it impossible to use tldextract in both modes simultaneously, but also would return unexepected results since choices made at runtime were ignored in favor of whatever was in the cache. For example: extract_private = tldextract.TLDExtract(include_psl_private_domains=True)
extract_public = tldextract.TLDExtract(include_psl_private_domains=False)
# these shouldn't be the same but will be currently
extract_private('waiterrant.blogspot.com`) == extract_public('waiterrant.blogspot.com') I'll also look at the CI build failures, but if my memory serves me, they were inconsistently passing. I'll revisit it though. |
78a546e
to
3f8c901
Compare
@brycedrennan I understand that this is a fix resulting in different (more correct) behavior. And, understood that if someone has been hitting undefined behavior, they could be subtly depending on undefined/unspecified behavior, and have a change now. sometimes in such a case it's good to force people to break to be aware of the update. i don't think it's the right way to do it here though. (can discuss further though... that's just my initial sense) The nature of a cache file ... it does not have to be breaking. Old snapshot or cache can be ignored. Will be using new cache keys so new way will Just Work. As for moving the argument from the TLDExtract constructor to the call, it does not need to be a breaking change either, both call signatures could be supported. That's not dissimilar from the Am I making sense? |
I do think the JSON aspect needs to be changed before merge as well (unless someone does some benchmarking to confirm it's OK, but TBH that seems like a yak-shave and it would be better to just make the change) So the two criteria that are pending, before merge:
|
It's also breaking because ExtractResult has been changed. |
Ah, right. Need to think about whether to keep that part ... I can see the pros but I can also see the cons... thinking about subclassing namedtuple and making |
I can take a stab at making the changes non-breaking — LMK |
Sorry to not have more time to respond.
|
On my computer, loading the json data file takes less than 1ms. import time
def test_load_cachefile():
start = time.perf_counter()
num = 1000
for _ in range(num):
data = TLDExtract._get_snapshot_tld_extractor()
duration = time.perf_counter() - start
print('')
per_load = (duration*1000)/num
print('%s ms per load' % per_load)
|
@brycedrennan thanks for doing that check! You raise a good point there in ((2)), callers can ensure they aren't doing redundant inits, so it does not need to be overengineered here. |
If you haven't already. Gonna give it a go at addressing your feedback. |
@brycedrennan agreed and yeah I was thinking about that. It could still be technically a breaking change if downstream relied on but! maybe we can do it with duck-typing and then also faking the type a bit. While faking the type seems like a smell (overriding multiple dunder methods ( Notes on what methods: https://stackoverflow.com/questions/6803597/how-to-fake-type-with-python |
@brycedrennan or maybe there's even a way to stick with 'real' named tuples and only the tiniest bit of type-manipulation.
@brycedrennan what do you think? if it seems basically sane, I'm happy to take a whack at it |
Sounds good. Can either of you summarize the breaking changes & upgrade steps in CHANGELOG.md in this PR? Put it at the top in a new "Unreleased" section? I don't normally update CHANGELOG.md outside of |
.gitignore
Outdated
@@ -7,3 +7,9 @@ tldextract_app/tldextract | |||
tldextract_app/web | |||
tldextract.egg-info | |||
.tox | |||
.idea |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: can we keep this file focused on just what this project needs to ignore? Personal preferences e.g. editor or platform should be in a personal .gitignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know I could do that - thanks!
README.md
Outdated
ExtractResult(subdomain='', domain='127.0.0.1', suffix='') | ||
``` | ||
|
||
If you want to rejoin the whole namedtuple, regardless of whether a subdomain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way to do this now?
I'd be okay dropping the whole idea of storing source and thus avoiding all the complications. |
@brycedrennan Agree, I can't convince myself any of the complicated options are worth going magical/weird for, as I continue thinking about it. Glad we discussed them though, thank you! I had another 💡 today -- Maybe we can get the best of both worlds
Or another way to 'get the best of both worlds' would be to do some thoughtful logging support. Either of those ways could be punted to another issue IMHO, we could drop Thoughts? |
So I’ve been working on a larger refactor and realized that having the plain text version of the list in the repo is super useful for testing. So im backtracking despite my previous arguments. I also think we can make something that acts like a named tuple and have it not be horrible. So current plan for this PR:
Adding back source, and other improvements can be in a different PR. |
@brycedrennan cheers, sounds good. Also nice bumping into you on the Poetry repo haha |
d9ca293
to
afc4066
Compare
Okay made the changes. I switched the snapshot to be just a copy of the raw text. I thought that this was the format before my changes, but looking at the history, it was not. I would argue that having it be the same content that we pull from the public server will allow us to simplify some code in the future and use the raw file in tests. (I've already made those changes in another, not yet pushed, branch) If everyone is okay with these recent changes I'll squash the latest commit into the first one so we have a cleaner commit history. |
@brycedrennan Last commit looks great — looking at the full diff I am wondering if there’s something else to change — but I’m multitasking right now and it could be PEBCAK on my end. I’ll poke around again tonight |
1427048
to
f946a74
Compare
Addresses john-kurkowski#66 but is not backwards compatible.
Handles concurrency issues.
f946a74
to
010a3af
Compare
Holy bump from the past. Thanks for your patience on this one. 😓 I merged in latest |
Superceded by #207 |
This is a second attempt at doing what was done in #144. Addresses #66. - Add `include_psl_private_domains` to the `__call__` method. This is now something you can choose on a per-call basis. The object level argument now is only a default value for each call. - The entire dataset from publicsuffix.org is saved to cache - Ensure no weird cache issues happen when using with different `suffix_list_urls` by using different filenames per `suffix_list_urls` - Use filelock to support multiprocessing and multithreading use cases - Update bundled snapshot to be the raw publicsuffix data. Need to look at performance impact of this. - Breaking change `cache_file` => `cache_dir`
Addresses #66 but is not backwards compatible.
Changes
include_psl_private_domains
to the__call__
method. This is now something you choose on a per-call basis.suffix_list_urls
by using different filenames persuffix_list_urls